智能论文笔记

Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

Serge Kas Hanna , Rawad Bitar , Parimal Parag , Venkat Dasari , Salim El Rouayheb

分类：机器学习

2022-08-04

我们考虑主人想要在$ n $ Workers上运行分布式随机梯度下降（SGD）算法的设置，每个算法都有一个数据子集。分布式SGD可能会遭受散乱者的影响，即导致延迟的缓慢或反应迟钝的工人。文献中研究的一种解决方案是在更新模型之前等待每次迭代的最快$ k <n $工人的响应，其中$ k $是固定的参数。 $ k $的价值的选择提供了SGD的运行时（即收敛率）与模型错误之间的权衡。为了优化误差折衷，我们研究了在整个算法的运行时，以自适应〜$ k $（即不同的$ k $）调查分布式SGD。我们首先设计了一种自适应策略，用于改变$ k $，该策略根据我们得出的墙壁通行时间的函数，基于上限的上限来优化这种权衡。然后，我们建议并实施一种基于统计启发式的自适应分布式SGD的算法。我们的结果表明，与非自适应实现相比，分布式SGD的自适应版本可以在更少的时间内达到较低的误差值。此外，结果还表明，自适应版本是沟通效率的，其中主人与工人之间所需的通信量小于非自适应版本的沟通量。

translated by 谷歌翻译

In this work, we present an evaluation of smaller BLOOM model variants (350m/560m and 1b3/1b7) on various natural language processing tasks. This includes GLUE - language understanding, prompt-based zero-shot and few-shot text classification and extraction, question answering, prompt-based text generation, and multi-lingual text classification to understand model strengths/weaknesses and behavior. Empirical results show that BLOOM variants under-perform on all GLUE tasks (except WNLI), question-answering, and text generation. The variants bloom for WNLI, with an accuracy of 56.3%, and for prompt-based few-shot text extraction on MIT Movies and ATIS datasets. The BLOOM variants on average have 7% greater accuracy over GPT-2 and GPT-Neo models on Director and Airline Name extraction from MIT Movies and ATIS datasets, respectively.

translated by 谷歌翻译